The Annals of Statistics 

2009, Vol. 37, No. 6B, 4254-4278 

DOI: 10.1214/09-AOS720 

© Institute of Mathematical Statistics, 2009 

SPARSISTENCY AND RATES OF CONVERGENCE IN LARGE 
COVARIANCE MATRIX ESTIMATION^ 

By Clifford Lam and Jianqing Fan 

London School of Economics and Political Science and Princeton 
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This paper studies the sparsistency and rates of convergence for 
estimating sparse covariance and precision matrices based on penal- 
ized hkehhood with nonconvex penahy functions. Here, sparsistency 
refers to the property that aU parameters that are zero are actually 
estimated as zero with probability tending to one. Depending on the 
case of applications, sparsity priori may occur on the covariance ma- 
trix, its inverse or its Cholesky decomposition. We study these three 
sparsity exploration problems under a unified framework with a gen- 
eral penalty function. We show that the rates of convergence for these 
problems under the Frobenius norm are of order (s„ log j)„/n)'^'^^, 
where s„ is the number of nonzero elements, p„ is the size of the co- 
variance matrix and n is the sample size. This explicitly spells out the 
contribution of high-dimensionality is merely of a logarithmic factor. 
The conditions on the rate with which the tuning parameter A„ goes 
to have been made explicit and compared under different penalties. 
As a result, for the Li-penalty, to guarantee the sparsistency and op- 
timal rate of convergence, the number of nonzero elements should be 
small: s'^ = 0(pn) at most, among O(p^) parameters, for estimating 
sparse covariance or correlation matrix, sparse precision or inverse 
correlation matrix or sparse Cholesky factor, where s'n is the num- 
ber of the nonzero elements on the off-diagonal entries. On the other 
hand, using the SCAD or hard-thresholding penalty functions, there 
is no such a restriction. 

1. Introduction. Covariance matrix estimation is a common statistical 
problem in many scientific applications. For example, in financial risk as- 
sessment or longitudinal study, an input of covariance matrix S is needed, 
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whereas an inverse of the covariance matrix, the precision matrix 5]~ , is re- 
quired for optimal portfoho selection, linear discriminant analysis or graphi- 
cal network models. Yet, the number of parameters in the covariance matrix 
grows quickly with dimensionality. Depending on the applications, the spar- 
sity of the covariance matrix or precision matrix is frequently imposed to 
strike a balance between biases and variances. For example, in longitudi- 
nal data analysis [see, e.g., Diggle and Verbyla (1998), or Bickel and Levina 
(2008b)], it is reasonable to assume that remote data in time are weakly cor- 
related, whereas in Gaussian graphical models, the sparsity of the precision 
matrix is a reasonable assumption [Dempster (1972)]. 

This initiates a series of researches focusing on the parsimony of a co- 
variance matrix. Smith and Kohn (2002) used priors which admit zeros 
on the off-diagonal elements of the Cholesky factor of the precision matrix 
n = while Wong, Carter and Kohn (2003) used zero-admitting prior 
directly on the off-diagonal elements of to achieve parsimony. Wu and 
Pourahmadi (2003) used the Modified Cholesky Decomposition (MCD) to 
find a banded structure for nonparametrically for longitudinal data. Bickel 
and Levina (2008b) developed consistency theories on banding methods for 
longitudinal data, for both I] and ft. 

Various authors have used penalized likelihood methods to achieve parsi- 
mony on covariance selection. Fan and Peng (2004) has laid down a general 
framework for penalized likelihood with diverging dimensionality, with gen- 
eral conditions for the oracle property stated and proved. However, it is 
not clear whether it is applicable to the specific case of covariance matrix 
estimation. In particular, they did not link the dimensionality pn with the 
number of nonzero elements in the true covariance matrix or the 
precision matrix Qq. A direct application of their results to our setting can 
only handle a relatively small covariance matrix of size pn = o{n^^^^). 

Recently, there is a surge of interest on the estimation of sparse covari- 
ance matrix or precision matrix using penalized likelihood method. Huang 
et al. (2006) used the LASSO on the off-diagonal elements of the Cholesky 
factor from MCD, while Meinshausen and Biihlmann (2006), d'Aspremont, 
Banerjee and El Ghaoui (2008) and Yuan and Lin (2007) used different 
LASSO algorithms to select zero elements in the precision matrix. A novel 
penalty called the nested LASSO was constructed in Levina, Rothman and 
Zhu (2008) to penalize off-diagonal elements. Thresholding the sample co- 
variance matrix in high-dimensional setting was thoroughly studied by El 
Karoui (2008) and Bickel and Levina (2008a) and Cai, Zhang and Zhou 
(2008) with remarkable results for high-dimensional applications. However, 
it is not directly applicable to estimating sparse precision matrix when the 
dimensionality pn is greater than the sample size n. Wagaman and Levina 
(2008) proposed an Isomap method for discovering meaningful orderings of 
variables based on their correlations that result in block-diagonal or banded 
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correlation structure, resulting in an Isoband estimator. A permutation in- 
variant estimator, called SPICE, was proposed in Rothman et al. (2008) 
based on penalized likelihood with Li-penalty on the off-diagonal elements 
for the precision matrix. They obtained remarkable results on the rates of 
convergence. The rate for estimating ft under the Frobenius norm is of or- 
der (s„ logp„/n)^/^, with dimensionality cost only a logarithmic factor in 
the overall mean-square error, where Sn = Pn + Snii Pn is the number of the 
diagonal elements and Sni is the number of the nonzero off-diagonal en- 
tries. However, such a rate of convergence neither addresses explicitly the 
issues of sparsistency such as those in Fan and Li (2001) and Zhao and Yu 
(2006), nor the bias issues due to the Li-penalty and the sampling distri- 
bution of the estimated nonzero elements. These are the core issues of the 
study. By sparsistency, we mean the property that all parameters that are 
zero are actually estimated as zero with probability tending to one, a weaker 
requirement than that of Ravikumar et al. (2007). 

In this paper, we investigate the aforementioned problems using the pe- 
nalized pseudo-likelihood method. Assume a random sample {yi}i<i<n with 
mean zero and covariance matrix Hq, satisfying some sub-Gaussian tails 
conditions as specified in Lemma 2 (see Section 5). The sparsity of the true 
precision matrix fio can be explored by maximizing the Gaussian quasi- 
likelihood or equivalently minimizing 

(1.1) qiifi) = tT{sn) -iog\n\ +J2px,A\^ij\), 

which is the penalized negative log-likelihood if the data is Gaussian. The 
matrix S = n~^J27=iyiyI the sample covariance matrix, = (wjj), and 
Pa„i(0 is a penalty function, depending on a regularization parameter A„i, 
which can be nonconvex. For instance, the Li-penalty p\{9) = \\9\ is con- 
vex, while the hard-thresholding penalty defined by p\{0) = — {\6\ — 
A)^1{|0|<-;^}, and the SCAD penalty defined by 

(1.2) p';^(0) = Al|e<;,} + (aA-^)+l{e>A}/(a-l) for some a > 2, 

are folded-concave. Nonconvex penalty is introduced to reduce bias when the 
true parameter has a relatively large magnitude. For example, the SCAD 
penalty remains constant when is large, while the Li-penalty grows lin- 
early with 9. See Fan and Li (2001) for a detailed account of this and other 
advantages of such a penalty function. The computation can be done via 
the local linear approximation [Zhou and Li (2008) and Fan, Feng and Wu 
(2009)]; see Section 2.1 for additional details. 

Similarly, the sparsity of the true covariance matrix Sq can be explored 
by minimizing 

(1.3) g2(S) = tr(SS-i) + log|I]| + (|a,, |), 
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where S = (cjj j ) . Note that we only penahze the off-diagonal elements of S 
or ri in the aforementioned two methods, since the diagonal elements of So 
and fto do not vanish. 

In studying a sparse covariance or precision matrix, it is important to 
distinguish between the diagonal and off-diagonal elements, since the diag- 
onal elements are always positive and they contribute to the overall mean- 
squares errors. For example, the true correlation matrix, denoted by Tq, 
has the same sparsity structure as Sq without the need to estimating its 
diagonal elements. In view of this fact, we introduce a revised method (3.2) 
to take this advantage. It turns out that the correlation matrix can be es- 
timated with a faster rate of convergence, at (snilogpn/n)^^'^ instead of 
{{pn + Sni) logp„/?i)-^/^, where s„i is the number of nonzero correlation co- 
efficients. We can take similar advantages over the estimation of the true 
inverse correlation matrix, denoted by See Section 2.5. This is an ex- 
tension of the work of Rothman et al. (2008) using the Li-penalty. Such an 
extension is important since the nonconcave penalized likelihood ameliorates 
the bias problem for the Li-penalized likelihood. 

The bias issues of the commonly used Li-penalty, or LASSO, can be seen 
from our theoretical results. In fact, due to the bias of LASSO, an upper 
bounded of A„j is needed in order to achieve fast rate of convergence. On the 
other hand, a lower bound is required in order to achieve sparsity of esti- 
mated precision or covariance matrices. This is in fact one of the motivations 
for introducing nonconvex penalty functions in Fan and Li (2001) and Fan 
and Peng (2004), but we state and prove the explicit rates in the current 
context. In particular, we demonstrate that the Li-penalized estimator can 
achieve simultaneously the optimal rate of convergence and sparsistency for 
estimation of Sq or CIq when the number of nonzero elements in the off- 
diagonal entries are no larger than 0{pn), but not guaranteed so otherwise. 
On the other hand, using the nonconvex penalties like the SCAD or hard- 
thresholding penalty, such an extra restriction is not needed. 

We also compare two different formulations of penalized likelihood using 
the modified Cholesky decomposition, exploring their respective rates of 
convergence and sparsity properties. 

Throughout this paper, we use Ainin(^)5 Amax(^) and tic{A) to denote the 
minimum eigenvalue, maximum eigenvalue, and trace of a symmetric matrix 
A, respectively. For a matrix B, we define the operator norm and the Probe- 
nius norm, respectively, as ||-B|| = XllaxiB^^ B) and ||-B||f = tr^^'^{B^ B). 

2. Estimation of sparse precision matrix. In this section, we present the 
analysis of (1.1) for estimating a sparse precision matrix. Before this, let us 
first present an algorithm for computing the nonconcave maximum (pseudo)- 
likelihood estimator and then state the conditions needed for our technical 
results. 
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2.1. Algorithm based on iterated reweighted Li-penalty. The computa- 
tion of the nonconcave maximum hkehhood problems can be solved by a 
sequence of Li-penalized likelihood problems via local linear approxima- 
tion [Zou and Li (2008) and Fan, Feng and Wu (2009)]. For example, given 
the current estimate ilk = {^ij,k)-, by the local linear approximation to the 
penalty function. 



(2.1) 



gi(n) f«tr(Sfi) - log|f2| 



Hence, ri^+i should be taken to maximize the right-hand side of (2.1): 



(2.2) fifc+i = argmax 



tr(Sn) -logl^l +^pl„^(|u;i,,fe|)|w,,-| 



after ignoring the two constant terms. Problem (2.2) is the weighted penal- 
ized Li-likelihood. In particular, if we take the most primitive initial value 
rio = 0; then 



rii = argmax 



tr(Sn) - log|f2| +A„i^|u;i^ 



is already a good estimator. Iterations of (2.2) reduces the biases of the 
estimator, as larger estimated coefficients in the previous iterations receive 
less penalty. In fact, in a different setup, Zou and Li (2008) showed that one 
iteration of such a procedure is sufficient as long as the initial values are 
good enough. 

Fan, Feng and Wu (2009) has implemented the above algorithm for opti- 
mizing (1.1). They have also demonstrated in Section 2.2 in their paper how 
to utilize the graphical lasso algorithm of Friedman, Hastie and Tibshirani 
(2008), which is essentially a group coordinate descent procedure, to solve 
problem (2.2) quickly, even when p„ > n. Such a group coordinate decent 
algorithm was also used by Meier, van de Geer and Biihlmann (2008) to 
solve the group LASSO problem. Thus iteratively, (2.2), and hence (1.1), 
can be solved quickly with the graphical lasso algorithm. See also Zhang 
(2007) for a general solution to the folded-concave penalized least-squares 
problem. The following is a brief summary of the numerical results in Fan, 
Feng and Wu (2009). 



2.2. Some numerical results. We give a brief summary of a breast can- 
cer data analysis with pn> n considered in Fan, Feng and Wu (2009). For 
full details, please refer to Section 3.2 of Fan, Feng and Wu (2009). Other 
simulation results are also in Section 4 in their paper. 
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Breast cancer data. Normalized gene expression data from 130 patients 
with stage I-III breast cancers are analyzed, with 33 of them belong to 
class 1 and 97 belong to class 2. The aim is to assess prediction accuracy 
in predicting which class a patient will belong to, using a set of pre-selected 
genes (p„ = 110, chosen by t-tests) as gene expression profile data. The data 
is randomly divided into training (n = 109) and testing sets. The mean vector 
for the genes expression levels is obtained from the training data, as well as 
the associated inverse covariance matrix estimated using LASSO, adaptive 
LASSO and SCAD penalties as three different regularization methods. A 
linear discriminant score is then calculated for each regularization method 
and applied to the testing set to predict if a patient belongs to class 1 or 2. 
This is repeated 100 times. 

On average, the estimated precision matrix fi using LASSO has many 
more nonzeros than that using SCAD (3923 versus 674). This is not sur- 
prising when we look at (2.3) in our paper, where the Li-penalty imposes 
an upper bound on the tuning parameter A^i for consistency, which links 
to reducing the bias in the estimation. This makes the Ani in practice too 
small to set many of the elements in fl to zero. While we do not know which 
elements in the true are zero, the large number of nonzero elements in 
the Li-penalized estimator seems spurious, and the resulting gene network 
is not easy to interpret. 

On the other hand, SCAD-penalized estimator has a much smaller number 
of nonzero elements, since the tuning parameter A„i is not bounded above 
under consistency of the resulting estimator. This makes the resulting gene 
network easier to interpret, with some clusters of genes identified. 

Also, classification results on the testing set using the SCAD penalty 
for precision matrix estimation is better than that using the Li-penalty, in 
the sense that the specificity (#True Negative/#class 2) is higher (0.794 
to 0.768) while the sensitivity (#True Positive/#class 1) is similar to that 
using Li-penalized precision matrix estimator. 

2.3. Technical conditions. We now introduce some notation and present 
regularity conditions for the rate of convergence and sparsistency. 

Let Si = -uj^j 7^ 0}, where = (u^ij) is the true precision matrix. 

Denote by Sni = \Si\ — Pn, which is the number of nonzero elements in the 
off-diagonal entries of CIq. Define 

a„i = max p'x^,{\uJ^j\), bni = max p^^^ {\uj%\). 

The term a.„i is related to the asymptotic bias of the penalized likelihood 
estimate due to penalization. Note that for Li-penalty, a„i = A„i and 6^1 = 
0, whereas for SCAD, a^i = 6ni = for sufficiently large n under the last 
assumption of condition (B) below. 

We assume the following regularity conditions: 
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(A) There are constants ri and T2 such that 

< n < Amin(5]o) < Aniax(So) < T2 < OO for all Tl. 

(B) a„i = 0({(l+p„/(s„i + l))}(logp„/n)i/2)^ ^^(1)^ and min(ij)gsjw0.|/ 
A„i ^ 00 as 71 ^ 00. 

(C) The penalty px{-) is singular at the origin, with limt 10 px{t) / {Xt) = k> 
0. 

(D) There are constants C and D such that, when 61,62 > CA„i, (^i) — 
p'iJ62)\<D\6^-62\. 

Condition (A) bounds uniformly the eigenvalues of SIq, which facilitates 
the proof of consistency. It also includes a wide class of covariance matrices 
as noted in Bickel and Levina (2008b). The rates a^i and 6^1 in condition 
(B) are also needed for proving consistency. If they are too large, the bias 
due to penalty can dominate the variance from the likelihood, resulting in 
poor estimates. 

The last requirement in condition (B) states the rate at which the nonzero 
parameters should be distinguished from zero asymptotically. It is not explic- 
itly needed in the proofs, but for asymptotically unbiased penalty functions, 
this is a necessary condition so that a„i and 6„i are converging to zero 
fast enough as needed in the first part of condition (B). In particular, for 
the SCAD and hard-thresholding penalty functions, this condition implies 
that a„i = bni = exactly for sufficiently large n, thus allowing a flexible 
choice of A^i. For the SCAD penalty (1.2), the condition can be relaxed as 

min{j,j)e5i l/^ni > a. 

The singularity in condition (C) gives sparsity in the estimates [Fan and 
Li (2001)]. Finally, condition (D) is a smoothing condition for the penalty 
function, and is needed in proving asymptotic normality. The SCAD penalty, 
for instance, satisfies this condition by choosing the constant D, independent 
of n, to be large enough. 

2.4. Properties of sparse precision matrix estimation. Minimizing (1.1) 
involves nonconvex minimization, and we need to prove that there exists 
a local minimizer Cl for the minimization problem with a certain rate of 
convergence, which is given under the Frobenius norm. The proof is given in 
Section 5. It is similar to the one given in Rothman et al. (2008), but now 
the penalty function is nonconvex. 

Theorem 1 (Rate of convergence). Under regularity conditions (A)- 
(D), if {pn + Sni)\ogpn/n = ©(A^ J and {pn + Sni){\ogPnY /n = 0(1) for 
some k> \, then there exists a local minimizer Ct such that 11^2 — fiollp = 
Op{{pn + Sni)^ogpn/n} . For the Li -penalty, we only need\ogpn/n = 0{\^i) . 
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The proofs of this theorem and others are relegated to Section 5 so that 
readers can get more quickly what the results are. As in Fan and Li (2001), 
the asymptotic bias due to the penalty for each nonzero parameter is a^i. 
Since we penalized only on the off-diagonal elements, the total bias induced 
by the penalty is asymptotically of order s„ia,„i. The square of this total 
bias over all nonzero elements is of order Op{{pn + Sni)iogpn/n} under 
condition (B). 

Theorem 1 states explicitly how the number of nonzero elements and 
dimensionality affect the rate of convergence. Since there are {pn + Sni) 
nonzero elements and each of them can be estimated at best with rate n~^/^, 
the total square errors are at least of rate (p„ + Sni)/n. The price that 
we pay for high-dimensionality is merely a logarithmic factor logp„. The 
results holds as long as (p„ -|- s„i)/n is at a rate 0{{logpn)~^) with some 
A: > 1, which decays to zero slowly. This means that in practice pn can be 
comparable to n without violating the results. The condition here is not 
minimum possible; we expect it holds for p^ n. Here, we refer the local 
minimizer as an interior point within a given close set such that it minimizes 
the target function. Following a similar argument to Huang, Horowitz and 
Ma (2008), the local minimizer in Theorem 1 can be taken as the global 
minimizer with additional conditions on the tail of the penalty function. 

Theorem 1 is also applicable to the Li-penalty function, where the local 
minimizer for A„i can be relaxed. In this case, the local minimizer becomes 
the global minimizer. The asymptotic bias of the Li-penalized estimate is 
given in the term Sni^ni = Sni^ni as shown in the technical proof. In order 
to control the bias, we impose condition (B), which entails an upper bound 
on A„i = 0({(1 +Pn/(sni + 1)) logPn/'^}^'^^)- The bias problem due to the 
-Li-penalty for finite parameter has already been unveiled by Fan and Li 
(2001) and Zou (2006). 

Next, we show the sparsistency of the penalized estimator from (1.1). We 
use 5"^ to denote the complement of a set S. 

Theorem 2 (Sparsistency). Under the conditions given in Theorem 1, 
for any local minimizer of (1.1) satisfying \\CI — Q,q\\^p = Op{{pn + Sni) \ogPn/n} 
and ||ri-f2o|P = Op{r]n) for a sequence ofr]n 0, iflogpn/n + rjn = ©(A^J, 
then with probability tending to 1, Uij = for all {i,j) G Sf. 

First, since < ||M|||. for any matrix M, we can always take 7/„ = 

{Pn + Sni)iogPn/n in Theorem 2, but this will result in more stringent re- 
quirement on the number of zero elements when Li-penalty is used, as we 
now explain. The sparsistency requires a lower bound on the rate of the 
regularization parameter A„i. On the other hand, condition (B) imposes an 
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upper bound on A„i when Li-penalty is used in order to control the biases. 
Exphcitly, we need, for Li-penahzed hkehhood, 

(2.3) \ogpn/n + rjn = O(A^i) = (1 +p„/(s„i + 1)) logp„/n 

for both consistency and sparsistency to be satisfied. We present two scenar- 
ios here for the two bounds to be compatible, making use of the inequalities 
ll^llF/Pn ^ II^^IP ^ II^IIf f^"^ ^ matrix M of size pn- 

1. We always have — f2o|| < — f^ollF- In the worst case scenario where 
they have the same order, — riolP = Op{{pn + Sni)logpn/n), so that 
Vn = ipn + Sni) logp„/n. It is then easy to see from (2.3) that the two 
bounds are compatible only when s„i = 0(1). 

2. We also have ||f2 — rio|||'/Pn < ||^^ — ^^o|P- In the optimistic scenario 
where they have the same order, 

- rio||2 = Op((l + Snl/Pn) logPn/n). 

Hence, ?7„ = (1 + s„i/p„) logp„/n, and compatibility of the bounds re- 
quires Snl =0{pn)- 

Hence, even in the optimistic scenario, consistency and sparsistency are guar- 
anteed only when Sni = 0{pn) if the Li-penalty is used, that is, the precision 
matrix has to be sparse enough. 

However, if the penalty function used is unbiased, like the SCAD or the 
hard-thresholding penalty, we do not impose an extra upper bound for Ani 
since its first derivative p'^ ^(|^|) goes to zero fast enough as \9\ increases 
[exactly equals zero for the SCAD and hard-thresholding penalty functions, 
when n is sufficiently large; see condition (B) and the explanation thereof]. 
Thus, A„i is allowed to decay to zero slowly, allowing even the largest order 
Snl = 0{pI). 

We remark that asymptotic normality for the estimators of the elements 
in Si have been established in a previous version of this paper. We omit it 
here for brevity. 

2.5. Properties of sparse inverse correlation matrix estimation. The in- 
verse correlation matrix retains the same sparsity structure of fto. Con- 
sistency and sparsistency results can be achieved with pn as large as logp„ = 
o(n), as long as (s„i -|- l)(logp„)'^/n = 0(1) for some /c > 1 as n — > oo. We 
minimize, w.r.t. ^ = {ipij), 

(2.4) tr(*f s) - log|*| + E^-nid^^jD' 

where Fs = W~^SW~^ is the sample correlation matrix, with = Ds 
being the diagonal matrix with diagonal elements of S, and f„i is a reg- 
ularization parameter. After obtaining can also be estimated by 
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To present the rates of convergence for ^ and CI, we define 

Cni= max p' ( I ijfj I ) , dni= max p'^ ( | -0°- 1 ) , 

where = {''Pij) and modify condition (D) to (D') with A„i there replaced 
by and impose 

(B') c„i = 0({logp„/n}^/2), = 0(1). Also, min(jj)e5j |V'j°l/z^ni ^ 00 as 
n — > 00. 

Theorem 3. Under regularity conditions (A), (B'), (C) and (D'), i/ 
(sni + l)(logp„)''/n = 0(1) for some k>l and + l)logp„/n = 0(1/^^), 
i/ien there exists a local minimizer 'I' for (2.4) such that \\^ — ^oIIf = 
Op{snilogpn/n) and Wfl — floW^ = Op{{sni + l)logpn/n) under the operator 
norm. For the Li-penalty, we only need \ogpn/n = 0{v'^i). 

Note that we can allow pn^ n without violating the result as long as 
logpn/n = 0(1). Note also that an order of {pnlogpn/n}^^'^ is removed by 
estimating the inverse correlation rather than the precision matrix, which 
is somewhat surprising since the inverse correlation matrix, unlike the cor- 
relation matrix, does not have known diagonal elements that contribute 
no errors to the estimation. This can be explained and proved as follows. 
If s„i = 0(j>n), the result is obvious. When Sni = o{pn), most of the off- 
diagonal elements are zero. Indeed, there are at most O(sni) columns of 
the inverse correlation matrix which contain at least one nonzero element. 
The rest of the columns that have all zero off-diagonal elements must have 
diagonal entries 1. These columns represent variables that are actually un- 
correlated from the rest. Now, it is easy to see from (2.4) that these diagonal 
elements, which are one, are all estimated exactly as one with no estimation 
error. Hence, an order of (p„ logp„/n)^/^ is not present even in the case of 
estimating the inverse correlation matrix. 

For the Li-penalty, our result reduces to that given in Rothman et al. 
(2008). We offer the sparsistency result as follows. 

Theorem 4 (Sparsistency). Under the conditions given in Theorem 3, 
for any local minimizer of (2.4) satisfying \\^ — ^oW'f — Op(s„i logp„,/n) 
and II* - *o|P = Op(r/„) for somerjn^O, if log pn/n + r]n = 0{i'^i) , then 
with probability tending to 1, ipij = for all {i,j) S S^. 

The proof follows exactly the same as that for Theorem 2 in Section 2.4, 
and is thus omitted. 

For the Li-penalty, control of bias and sparsistency require Vni to satisfy 
bounds like (2.3): 

(2.5) logPn/n + r]n = O(z^^i) = logpn/n. 
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This leads to two scenarios: 

1. The worst case scenario has 

II* - *of = II* - *0||f = Op{Snl\ogPn/n), 

meaning r]n = Snilogpn/n. Then compatibihty of the bounds in (2.5) 
requires s^i = 0(1). 

2. The optimistic scenario has 

II* - *of = II* - *0|||/Pn = Op{Snl/Pn ' logPn/n), 

meaning rjn = Sni/Pn ■ ^ogpn/n. Then compatibihty of the bounds in (2.5) 
requires s„i = 0(p„). 

On the other hand, for penalties like the SCAD or the hard-thresholding 
penalty, we do not need an upper bound for Sni - Hence, we only need (s^i + 
l)(logp„)'^/n = 0(1) as n — > oo for some A; > 1. It is clear that SCAD results 
in better sampling properties than the Li-penalized estimator in precision 
or inverse correlation matrix estimation. 

3. Estimation of sparse covariance matrix. In this section, we analyze 
the sparse covariance matrix estimation using the penalized likelihood (1.3). 
Then it is modified to estimating the correlation matrix, which improves the 
rate of convergence. We assume that the y^'s are i.i.d. iV(0, SIq) throughout 
this section. 

3.1. Properties of sparse covariance matrix estimation. Let 5*2 = {{i,j) '■ 
a^j / 0}, where So = i^fj)- Denote Sn2 = |'S'2| — Pn, so that Sn2 is the number 
of nonzero elements in Sq on the off-diagonal entries. Put 

a„2 = ^ max p'^^^ ( | o-°- 1 ) , 6„2 = ^ max p'i^^ ( | afj \ ) . 

Technical conditions in Section 2 need some revision. In particular, con- 
dition (D) now becomes condition (D2) with Ani there replaced by An2- 
Condition (B) should now be 

(B2) a„2 = 0({(l+p„/(sn2 + l))logp„/n}i/2)^ = o(l), and min(jj)gS2 |o-°.|/ 
Xn2 ^ 00 as n ^ 00 . 

Theorem 5 (Rate of convergence). Under regularity conditions (A), 
(B2), (C) and (D2), if (p„ + Sn2) {^og pn)^ /n = 0(1) for some k > 1 and 

{pn + Sn2)^ogpn/n = 0{X'^2)> then there exists a local minimizer S such 
that — "SqUp = Op{(j)n + Sn2)^ogpn/n). For the Li-penalty, we only need 
logpn/n = 0{\l2)- 
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When the Li-penalty is used, the condition for A„2 is relaxed to logpn/n = 
0(A^2)- Like the case for precision matrix estimation, the asymptotic bias 
due to the Li-penalty is of order s„2an2 = Sn2^n2- To control this term, for 
the Li-penalty, we require Xn2 = +Pn/(sn2 + 1)) logPn/"'}^'^^)- 

Theorem 6 (Sparsistency). Under the conditions given in Theorem 
5, for any local minimizer S of (1.3) satisfying ||S — ^oWf ~ Op{{pn + 
Sn2)iogpn/n) and ||S - SqIP = Op(r/„) /or some ??„ ^ 0, if \ogpn / n + rjn = 
0(A^2); then with probability tending to 1, dij = for all {i,j) G 5*2 • 

For the Li-penalized likelihood, controlling of bias for consistency to- 
gether with sparsistency requires 

(3.1) logPn/n + r]n = 0(A^2) = (1 +Pn/(Sn2 + 1)) logp.„/?l. 

This is the same condition as (2.3), and hence in the worst case scenario 
where 

||S - Sof = \\^ - Solll = Op{{pn + s„2)logp„/n), 
we need s„2 = 0(1). In the optimistic scenario where 

11^ — SqII^ = ||S — SoIIf/Pti, 

we need s„2 = 0{pn)- In both cases, the matrix XIq has to be very sparse, 
but the former is much sparser. 

On the other hand, if unbiased penalty functions like the SCAD or hard- 
thresholding penalty are used, we do not need an upper bound on A„2 since 
the bias a„,2 = for sufficiently large n. This gives more flexibility on the 
order of Sn2- 

Similar to Section 2, asymptotic normality for the estimators of the ele- 
ments in 5*2 can be proved under certain assumptions. 

3.2. Properties of sparse correlation matrix estimation. The correlation 
matrix Fq retains the same sparsity structure of Sq with known diagonal 
elements. This special structure allows us to estimate Fq more accurately. 
To take advantage of the known diagonal elements, the sparse correlation 
matrix Fq is estimated by minimizing w.r.t. F = (7jj), 

(3.2) tr(F-ifs)+log|F|+5]p,„,(|7.,|), 

where ^^2 is a regularization parameter. After obtaining F, Sq can be esti- 
mated by E = Wf W. 

To present the rates of convergence for F and S, we define 

Cn2= max p'^^^ ( 1 7°. | ) , d„2 = max p'l^^ ( 1 7°. | ) , 



COVARIANCE ESTIMATION WITH PENALIZATION 13 

where Fq = (7^^,•)• We modify condition (D) to (D2') with A„2 there replaced 
by i/„2, and (B) to (B') as foUows: 

(B2') c„2 = 0{{logpn/ny/'^), dn2 = 0(1), and min(jj)g52 hij\/'^n2 ^ 00 as 
n 00. 

Theorem 7. Under regularity conditions (A), (BSf ), (C) and (DSf), if 
{Pn + Sn2){^ogpn)'' /n = 0{l) for some k>l and (s„2 + 1) logPnA^ = ©(f ^g), 
then there exists a local minimizer T for (3.2) such that 

||f - VqW], = Op{Sn2 logpn/n). 

In addition, for the operator norm, we have 

\\t - Sof = Op{{Sn2 + 1) logPn/n}. 

For the Li-penalty, we only need logpn/n = 0(f^2)- 

The proof is sketched in Section 5. This theorem shows that the correlation 
matrix, like the inverse correlation matrix, can be estimated more accurately, 
since diagonal elements are known to be one. 

Theorem 8 (Sparsistency). Under the conditions given in Theorem 7, 
for any local minimizer T of (3.2) satisfying \\T — Tq\\'p = O p{sn2^ogpn/ n) 
and ||r - Top = Op{r]n) for some r]n 0, if logp„/n + ?/„ = 0(1^^2)^ ^^^n 
with probability tending to 1, 'jij = for all {i,j) £ 5*2 . 

The proof follows exactly the same as that of Theorem 6 in Section 5, and 
is omitted. For the Li-penalized likelihood, controlling of bias and sparsis- 
tency requires 

(3.3) \ogPn/n + r]n = 0(f^2) = ^ogpn/n. 

This is the same condition as (2.5), hence in the worst scenario where 

||f - Fof = ||f - Folll = Op{Sn2logPn/n), 
we need Sn2 = 0{1). In the optimistic scenario where 

||f - Foll^ = ||f - To\\p/pn = Op{Sn2/Pn ' logp„,/n), 

we need Sn2 = 0{pn). 

The use of unbiased penalty functions like the SCAD or the hard-thresholding 
penalty, similar to results in the previous sections, does not impose an up- 
per bound on the regularization parameter since bias c„2 = for sufficiently 
large n. This gives more flexibility to the order of Sn2 allowed. 
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4. Extension to sparse Cholesky decomposition. Pourahmadi (1999) pro- 
posed the modified Cholesky decomposition (MCD) which faciUtates the 
sparse estimation of ft through penahzation. The idea is to represent zero- 
mean data y = (yi, . . . , yp„)^ using the autoregressive model: 

i-l 

(4.1) = 51 <^*iyi + and TST^ = D, 

where T is the unique unit lower triangular matrix with ones on its diagonal 
and (i, j)th element being —<j)ij for j < i, and D is diagonal with ith element 
being af =var(ej). The optimization problem is unconstrained (since the 
(pij's are free variables), and the estimate for ft is always positive-definite. 

Huang et al. (2006) and Levina, Rothman and Zhu (2008) both used the 
MCD for estimating CIq. The former maximized the log-likelihood (ML) over 
T and D simultaneously, while the latter suggested also a least square ver- 
sion (LS), with D being first set to the identity matrix and then minimizing 
over T to obtain T. The latter corresponds to the original Cholesky decom- 
position. The sparse Cholesky factor can be estimated through minimizing 

(4.2) (ML): g3(T,D)=tr(T^D-^TS) + log|D| + 2^PA„3(|i^il)- 

i<j 

This is indeed the same as (1.1) with the substitution of = T-^D~^T and 
penalization parameter A^s. Noticing that (4.1) can be written as Ty = e, 
the least square version is to minimize tr(£:e:'^) = tr(T"^Tyy^) in the matrix 
notation. Aggregating the n observations and adding penalty functions, the 
least-square criterion is to minimize 

(4.3) (LS): g4(T)=tr(T^TS) + 25]pA„4(|iul)- 

i<j 

In view of the results in Sections 2.5 and 3.2, we can also write the sample 
covariance matrix in (4.2) as S = WFsW and then replace D~"'^/^TW by 
T, resulting in the normalized (NL) version as follows: 

(4.4) (NL): g5(T) = tr(T^Tfs) - 2log\T\ + 2Y,px,J\U,\). 

i<j 

We will also assume the y^'s are i.i.d. A^(0,So) as in the last section. 

4.1. Properties of sparse Cholesky factor estimation. Since all the T's 
introduced in the three models above have the same sparsity structure, let 
S and Sn3 be the nonzero set and number of nonzeros associated with each 
T above. Define 

a„3 = max p'^ ( | i ° • | ) , bns = max p'i ( 1 1° • | ) . 

For (ML), condition (D) is adapted to (D3) with A„i there replaced by A„3. 
Condition (B) is modified as: 
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(B3) a„3 = 0({(l+p„/(s„3 + l))logp„/n}i/2), 5„3 = o(l) and mm(^^,j)es \<P^j\/ 
Xn3 — > oo as n — > oo. 

After obtaining T and D from minimizing (ML), we set f2 = T^D~^T. 

Theorem 9. Under regularity conditions (A), (B3), (C), (D3), if{pn + 
Sn3)(logp„)''/n = 0(1) for some k > 1 and (p„ + Sn3)logp„/n = 0{Xl^), 
then there exists a local minimizer T and D for (ML) such that ||T — 
ToIIf = Op(sn3logp„/n), ||D - Dolll = Op(pnlogPn/'ra) and ||f2-Jlo|||' = 
Op{{pn + Sn3)iogpn/n}. For the Li -penalty, we only need logpn/n = 0{X'^^) . 

The proof is similar to those of Theorems 5 and 7 and is omitted. The 
Cholesky factor T has ones on its main diagonal without the need for esti- 
mation. Hence, the rate of convergence is faster than ft. 

Theorem 10 (Sparsistency). Under the conditions in Theorem 9, for 
any local minimizer T, D of (4-2) satisfying ||T — To|||' = Op (s„3 log p„,/n) 
and ||D-Do||p = Op (p„ log p„/n), iflogpn/n + rjn + Cn = 0{Xl^)^ then spar- 
sistency holds for T, provided that ||T — TqIP = Op{r]n) and ||D — DqP = 
Op{Cn), for some Cn ^ 0. 

The proof is in Section 5. For the Li-penalized likelihood, control of bias 
and sparsistency impose the following: 

(4.5) logp„/n + r]n + Cn = 0{Xl.^) = (1 +p„/(s„3 + 1)) logp„/n. 

The worst scenario corresponds to r/„ = s„3logp„/n and Cn = Pn^ogpn/n, 
so that we need s„3 = 0(1). The optimistic scenario corresponds to rjn = 
Sns/Pn ■ ^ogpn/n and Cn = logPn/f^, SO that we need s„3 = 0(p„). 

On the other hand, such a restriction is not needed for unbiased penalties 
like the SCAD or hard-thresholding penalty, giving more flexibility on the 
order of s^a- 

4.2. Properties of sparse normalized Cholesky factor estimation. We now 
turn to analyzing the normalized penalized likelihood (4.4). With T = [tij) 
in (NL) which is lower triangular, define 

a„5 = max p'^,^^(|4|), 6„5 = max p'i^^(|t°.|). 

Condition (D) is now changed to (D5) with A.„i there replaced by A„5. 
Condition (B) is now substituted by 

(B5) = 0(logp„/n), 6„5 = o(l), min(jj)g5. \t^/Xn5 ^ oo as cx). 
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Theorem 11 (Rate of convergence). Under regularity conditions (A), 
(B5), (C) and (D5), if Snsilogpn)'^ /n = 0(1) for some k > I and {sns + 
1) log j>„/n = 0(A^5), then there exists a local minimizer T for (NL) such 
that ||T — TqIII' = Op{sn3 logpn/n) and rate of convergence in the Frobenius 
norm 

\\ft - QoWf = Op{{pn + s„3)logp„/n} 
and in the operator norm, it is improved to 

- f2o|P = Op{{Sn3 + 1) logPn/n}. 

For the Li-penalty, we only need logpn/n = 0{X'^^). 

The proof is similar to that of Theorems 5 and 7 and is omitted. In this 
theorem, like Lemma 3, we can have pn so that Pn/n goes to a constant less 
than 1. It is evident that normalizing with W results in an improvement in 
the rate of convergence in operator norm. 

Theorem 12 (Sparsistency). Under the conditions given in Theorem 
11, for any local minimizerT of (4-4) satisfying ||T — To||p' = Op (sns log p„/n) 
if log Pn/n + r]n = 0{Xf^r^) , then sparsistency holds forT, provided that ||T — 
Top = 0{rin) for some rjn 0. 

Proof is omitted since it goes exactly the same as that of Theorem 10. 
The above results apply also to the Li-penalized estimator. For simultaneous 
persistency and optimal rate of convergence using the Li-penalty, the biases 
inherent in it induce the restriction Sns = 0(1) in the worst scenario where 
Vn — •Snslogpn/n, and s„3 = 0{pn) in the optimistic scenario where r/,^ = 
SnzlVn ■ logPn/"-- This restriction does not apply to the SCAD and other 
asymptoticaUy unbiased penalty functions. 

5. Proofs. We first prove three lemmas. The first one concerns with 
inequalities involving the operator and the Frobenius norms. The other 
two concern with order estimation for elements in a matrix of the form 
A(S — 5]o)B, which are useful in proving results concerning sparsistency. 

Lemma 1. Let A and B he real matrices such that the product AB is 
defined. Then, defining || A||^jj^ = Aniin(A"^A), we have 

(5-1) ||A||min||B||F < ||AB||p < ||A||||B||p. 

In particular, if A = (uij), then \aij\ < ||A|| for each i,j. 
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Proof. Write B = (bi, . . . , b^), where bj is the ith column vector in B. 
Then 



1 

i=l i=l 



\AB\\l = tr(B^A^AB) = ^ bf A^Ab, < A„,ax(A^A) ^ \\hi 
= IIAlPllBl 



If- 



Similarly, 



II ABIll = ^ bf A^Abi > A,ni„(A^A) ^ ||bi 

i=l 1=1 

II A ||2 ||"R||2 

— ll-^llminll-'-'llF) 

which completes the proof of (5.1). To prove |ajj| < ||A||, note that a. 
ejAej , where ej is the unit column vector with one at the ith position, and 
zero elsewhere. Hence, using (5.1), 

|o«il = l^f Ae^l < ||Aej||i? < ||A|| • llejUi? = ||A||, 

and this completes the proof of the lemma. □ 

Lemma 2. Let S be a sample covariance matrix of a random sample 
{yi}i<i<n, with E{yi) = and var(yi) = Sq. Let yi = {yn, . . . ,yip^) with 
yij ~ Fj, where Fj is the c.d.f. of yij, and let Gj be the c.d.f. of yfj, with 

roc 

(5.2) max / exp(At) dG,(t) < oo, 0<|A|<Ao, 

l<i<Pn JQ 

for some Aq > 0. Assume logpn/n = o(l), and that Sq has eigenvalues uni- 
formly bounded above as oo. Then for constant matrices A and B with 
||A||, ||B|| = 0(1), we have maxij|(A(S - So)B)ij| = Op({logp„/n}i/2). 

Remark. The conditions on the yjj's above are the same as those used 
in Bickel and Levina (2008b) for relaxing the normality assumption. 

Proof of Lemma 2. Let Xj = Ay^ and w,/ = B'^yj. Define Uj = (xf , wf )^, 
with covariance matrix 

„ f . f AEoA^ ASoB 

S„ = var(u,,)=(^3T5.^^T B^SqB 

Since ||(A^B)^|| < (||Af + ||Bf )i/2 = o{l) and ||So|| = 0(1) uniformly, 
we have ||5]u|| = 0(1) uniformly. Then, with Su = n~^J27=i'^i'^T : which is 
the sample covariance matrix for the random sample {uj}i<j<„, by Lemma 
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A. 3 of Bickel and Levina (2008b) which holds under the assumption for the 
Uij^s and logpn/n = o(l), we have 

max|(Su - ^u)ij \ = Op({logp„/n}i/2). 
In particular, it means that 

max|(A(S-5]o)B)..|= |^n-if^^x,w^-ASoB^ = Op{{\ogpn/n}^^^), 

which completes the proof of the lemma. □ 

Lemma 3. Let S be a sample covariance matrix of a random sample 
yii<i<n ^^^^ Yi ~ ^(0) ^o)- Assume Pn/n ^ y £ [0, 1), Sq has eigenvalues 
uniformly bounded as oo, and A = Aq + Ai , B = Bq + A2 are such that 
the constant matrices || Ao||, ||Bo|| = 0(1), with ||Ai||, IIA2II = op(l). Then 
we still have maxjj |(A(S — 5]o)B)ij| = Op({logp„/n}i/^). 

Proof. Consider 

(5.3) A{S--So)B = Ki + K2 + K3 + K4, 

where Ki = Ao(S - 5]o)Bo, K2 = Ai(S - So)Bo, K3 = Ao(S - 5]o)A2 and 
K4 = Ai(S-I]o)A2. Now, maxij \{Ki)ij\ = Op({logp„/n}i/2) Lemma 2. 
Consider K2- Suppose the maximum element of the matrix is at the (i, j)th 
position. Consider ((S — I]o)Bo)ij, the (i, j)th element of (S — So)Bo. Since 
each element in S — Sq has a r ate Op (n- 1/2), the ith row of S — So has a 
norm of Opdpn/n}^^'^). Also, the jth column of Bq has ||Boej|| < ||Bo|| = 
0(1). Hence, ((S - ^o)Bo)ij = Op(K/n}i/2). 

Hence, we can find c„ = o{{n/pn}^^^) such that each element in c„Bq (S — 
So) has an order larger than that in Ai, since ||Ai|| = op(l) implies that 
each element in Ai is also op(l) by Lemma 1. 

Then suitable choice of c„ leads to 

(5.4) max|(Ai(S - So)Bo),,| < c„max|(B^(S - ^ofBo)^^\. 

At the same time. Theorem 5.10 in Bai and Silverstein (2006) implies that, 
for Yi N{0, So) and Pn/n y £ (0, 1), with probability one, 

-2y/y-y < liminf Amin(So^^^SSo^''^ - I) 

n— >oo 

< limsup A^ax(So '/'SS" - I) < 2v^ + y. 

n— *-oo 

Hence, if we have Pn/n = o(l), we must have ||Sq SSq — I|| = op(l), 
or it will contradict the above. It means that ||S — So|| = op(l) since So 
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has eigenvalues uniformly bounded. Or, if Pn/n ^ y G (Oi !)> then we have 
||S - Soil = Op(l) by the above. 

Since S — Sq is symmetric, we can find a rotation matrix Q (i.e., Q"^Q = 
QQ^ = /) so that 

S - So = QAQ^, 

where A is a diagonal matrix with real entries. Then we are free to control c„ 
again so as to satisfy further that Cn||A|p = op(||A||), since ||A|| = ||S — So|| = 
Op{l) at most. Hence, 

c„max|(B;[(S - So)2Bo);-,J = max|(B^Qc„A2Q^Bo)fcfc| 

k k 

<max|(B;5^QAQ^Bo)fcfc| 

k 

= max|(B;^(S-So)Bo),,| 
= Op({logp„/n}V2), 

where the last line used the previous proof for constant matrix Bq. Hence, 
combining this with (5.4), we have maxjjKi('2)jj| = O p {{log Pn / n}^^"^) . Sim- 
ilar arguments go for K^, and K^. □ 

Proof of Theorem 1. The main idea of the proof is inspired by Fan 
and Li (2001) and Rothman et al. (2008). Let [/ be a symmetric matrix of 
size pn, D[/ be its diagonal matrix and R[/ = U — be its off-diagonal 
matrix. Set Afj = an^u + Pn^u- We would like to show that, for a„ = 

(s„i logp„/n)^/^ and /3„ = {pnlogpn/n)^/'^ , and for a set A defined as ^ = 
{U:\\Au\\j, = C!al + Cll3l}, 

P( mf^gi(no + Au) > qiifto)) ^ 1, 

for sufficiently large constants Ci and C2. This implies that there is a local 
minimizer in {flo + Au ■ {{AuHp < Cla^ + Cf/?^} such that — ^Io\\f = 
Op{an + (3n) for sufficiently large n, since VLq + Au is positive definite. This 
is shown by noting that 

Amm(^0 + Au) > Amin(^o) + Amm(A(/) > Amin(^^o) " 11^(711^ > 0, 

since Qq has eigenvalues uniformly bounded away from and oo by condition 
(A), and || Ac/||f = 0(a„ + /3„) = o(l). 

Consider, for S = Sq + Au, the difference 



qi{n) - qiiQo) = h + h + h, 
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where 



h = tr(Sn) - logl^l - (tr(Sno) - log|f^o|), 

^2= E (PA„.(k.,|)-PA„.(l4l))' 

h= E (PA„,(|^.,|)-PA„,(|4|)). 
{i,j)&Si,i=^j 

It is sufficient to show that the difference is positive asymptotically with 
probability tending to 1. Using Taylor's expansion with the integral remain- 
der, we have /i = Ki + K2, where 



(5.5) 



Ki=tr((S-So)Ac7), 

K2 = vec(Aj;)^|^ g{v, - v) dv^ vec{Au) 

with the definitions Cly = CIq + vAu, and g{v, fly) = 8 fl^^- Now, 
K2> f (l-v) min X^i,,{ft:;;^ (g) ft~^) dv \\vec{Au)f 

Jo 0<v<l 

= ||vec(A[;)||V2- min X^^^i^l.) 



0<?J<1 



>||vec(Ac/)||V2-(||f^o|| + ||A 



u\ 



>(C?a2+C|/?2)/2.(rfl + o(l))-^ 

where we used \\Au\\ < Ciq„ + C2Pn = 0((logPn)*-^~^''^^) = o{l) by our as- 
sumption. 

Consider Ki. It is clear that \Ki\ < Li + L2, where 



Li 



E (S - ^o)ji(A[/)ij 

E ~'^o)ij{^u)ij 



Using Lemmas 1 and 2, we have 

Li < (s„i -hp„)^/^max|(S - X;o)ij| • \\Au\\f 

< Op(a„ + /3n)- II At; IliT 
= Op(C7ia2+C72/32). 

This is dominated by K2 when Ci and C2 are sufficiently large. 
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Now, consider I2 — L2 for penalties other than Li. Since || A(/||^ = Cfa'^ + 
Cf/?^ on A, we have that \ujij\ = 0(Cia„ + C2/3n) = o(l) for ah (i, j) G Sf. 
Also, note that the condition on A^i ensures that, for G Sf, \u)ij\ = 
0{an + Pn) = o(A„i). Hence, by condition (C), for all (i, j) E 5f , we can find 
a constant /ci > such that 

P\nli\'^ij\) ^ Ari,lA;i|(Jij|. 

This implies that 

h= PA„i(kiil) > Anl/Ci ^ \uJij\. 

Hence, 

h-L2> ^ {Ani/cil^ijl - |(S - So)ij| • I'^jjl} 

> E [Kiki-Op{{\o^Pn/nfl^)]-\uj,j\ 

= Xni E [^i-Op(A;:iHiogPnMi/2)].|^^.|_ 

With the assumption that {pn + Sni)logp„/n = 0(A,^]^), we see from the 
above that I2 — L2>0 since Op{X~l{logpn/n}^^^) = op(l), using logp„/n = 

0((Pn + Snl) logpn/n) = o(A^J. 

For the Li-penalty, since we have maxj^j |S — SIqI = Op((logp„/n)^/^) by 
Lemma 2, we can find a positive W = Op{l) such that 

max|S - Eol = VF(logp,,/n)^/^ 

Then we can set A„i = 2W {log Pn/n)^^^ or one with order greater than 
(logpri/n)^/^, and the above arguments are still valid, so that I2 — L2> 0. 

Now, with Li dominated by K2 and I2 — L2> 0, the proof completes if 
we can show that is also dominated by K2, since we have proved that 
K2 > 0. Using Taylor's expansion, we can arrive at 

I/3I < min(Ci, C2)-i • 0(1) • {Clal + C^fil) + o(l) • {CWn + Cl^l), 

where o(l) and 0(1) are the terms independent of Ci and C2. By condition 
(B), we have 

I/3I = C ■ 0{al + (il) + C2 • o{al + (51), 

which is dominated by K2 with large enough constants Ci and C2. This 
completes the proof of the theorem. □ 



22 C. LAM AND J. FAN 

Proof of Theorem 2. For ft a minimizer of (1.1), the derivative for 
qi{fl) w.r.t. ujij for G S2 is 

dqi{n) 



2{sij-aij +p\^^{\uJij\)sgn{ujij)), 



where sgn(a) denotes the sign of a. If we can show that the sign of dqi{ft) / du>ij 
depends on sgn(wij) only with probabihty tending to 1, the optimum wih 
be at 0, so that LOij = for all G S2 with probability tending to 1. We 
need to estimate the order of Sij — uij independent of i and j . 
Decompose Sij — aij = I1 + I2, where 

J^l — Sij — Cr^j, J^2 — CTij — CTij- 

By Lemma 2 or Lemma A. 3 of Bickel and Levina (2008b), it follows that 
maxjj- = Op{{logpn/n}^^^). It remains to estimate the order of l2- 
By Lemma 1, \aij — cr^j\ < ||S — ^o\\, which has order 

llS-Eoll = ||I](n-no)So|| < ||S|| • llfi-fioll • W^oW =0{\\ft-flo\\), 
where we used condition (A) to get ||So|| = 0(1), and using rjn^O so that 
Ai„in(n - fio) = 0(1) for \\n - fioll = 0(r?y^), 

WE\\ = x;^l{n) < (Xn^u^o) + x^i^{n - no)r' 

= (0(l) + o(l))-i = 0(l). 

Hence, ||ri — rio|| = 0{r]n'^) implies I/2I = 0{r]n'^). 
Combining the last two results yields that 

max|sij - o-jjl = Op{\sij - o-° | + r/^^) 

= Op({logp,/n}V2 + ^i/2). 
By conditions (C) and (D), we have 

P'\r,ii\^ij\) = C2.Xnl 

for ojij in a small neighborhood of (excluding itself) and some positive 
constant C3. Hence, if uJij lies in a small neighborhood of 0, we need to have 
logpn/n + r]n = O(A^i) in order to have the sign of dqi{Q)/duJij depends 
on sgn(a;jj) only with probability tending to 1. The proof of the theorem is 
complet. □ 

Proof of Theorem 3. Because of the similarity between (2.4) and 
(1.1), the Frobenius norm result has nearly identical proof as Theorem 1, 
except that we now set Ajy = a„C/. For the operator norm result, we refer 
readers to the proof of Theorem 2 of Rothman et al. (2008). □ 
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Proof of Theorem 5. The proof is similar to that of Theorem 1. We 
only sketch briefly the proof, pointing out the important differences. 

Let On = {sn2'^ogpn/n)'^/'^ and /?„ = (pnlogpn/n)'^/'^ , and define A = {U: 
\\Au\\f = Cfal + C^Pl}. Want to show 

P( inf (72(5]o + Au) > g2(So)) ^ 1 

for sufficiently large constants Ci and C2. 
For S = So + Afj, the difference 



q2CS)-q2CSo)=h+l2 + h, 



where 



/i =tr(5n) + log|S| - (tr(5no) +log|So|), 
^2= E (PA„.(k,,|)-pA„.(l4l))' 

with Ii = Ki + K2 , where 

Ki = - tr((S - So)noAc/no) = - tr((Sao - ^o)^u), 



(5.6) 



K2 = vec(Ac/)^|^ g{v, S„)(l - v) dv^ vec(Af/), 



and = So + vAjj, Sfj^ is the sample covariance matrix of a random 
sample {xj}i<j<„ having Xj A^(0,rio)- Also, 

(5.7) g{v, S„) = S-i ® S-^SS-i + S-^SS"! S"^ - S^^ S^^ 

The treatment of K2 is different from that in Theorem 1. By condition 
(A), and {pn + s„2)(logPn)^/?^ = 0(1) for some > 1, we have 

||7;Ac;f^o|| < ||Ac/|| llfioll < rf ^(Cia„ + C2/?„) = 0((logp„)i-'=) = o(l). 

Thus, we can use the Neumann series expansion to arrive at 

S-i = fto{I + vAuno)-^ = no{I- vAuf^o + 0(1)), 

where the little o (or op, O or Op in any matrix expansions in the remainder 
of this proof) represents a function of the L2 norm of the residual matrix 
in the expansion. That is, S~"^ = flo + Op{an + Pn), and ||S~^|| = t^^ + 
Op {an + Pn)- With S/ defined as the sample covariance matrix formed from 
a random sample {xj}i<j<„ having Xj N{0,I), 

lis- Soil = Op(||S7-/||)=op(l) 
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(see arguments in Lemma 3). These entail 

ss-i = (s - So)s;;^ + 5]oS;;i = op(i) + / + Op{an + Pn) = i + op{i). 

Combining these results, we have 

g{v, S^) = fio f^o + Op{an + Pn)- 

Consequently, 

K2 = vec(Ac/)^|^ ^0 <S) ^0(1 + op(l))(l - v) dv^ vec{Au) 

> A^i„(no ® ^o)\\ vec(Ac/)|| 2/2 • (1 + op(1)) 
= rr2(C72a2+c|/32)/2.(l + op(l)). 

All other terms are dealt with similarly as in the proof of Theorem 1 , and 
hence we omit them. □ 

Proof of Theorem 6. The proof is similar to that of Theorem 2. We 
only show the main differences. 
It is easy to show 

Odij 

Our aim is to estimate the order of |(— riSfi + ri)jj|, finding an upper bound 
which is independent of both i and j. 
Write 

-nsn + n = /i + /2, 

where h = -ft{S - So)^ and h = ^(S - So)^. Since 

ll^ll — ^mL(^) — (-^min(^o) + Amin(S — Sq)) ^ 

we have 

= fio + {ft - fio) = fto - - 5]o)no = f^o + A, 

1 /2 

where ||A|| < ||ri|| • ||S — Soil ■ ||^o|| = 0{r]n ) = o(l) by Lemma 1, with 
||S — So||2 = 0{r]n)- Hence, we can apply Lemma 3 and conclude that 
maxij |(/i)ij| = Opdlogpn/ny/"^). 
For I2, we have 

max|(/2).,| < ll^ll • IIS - Soil • 11^11 = 0(||S - So||) = 0{r]l/^). 
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Hence, we have 

max|(-nsn + = 0({logp„/n}i/2 + ^1/2), 

The rest goes similar to the proof of Theorem 2, and is omitted. □ 

Proof of Theorem 7. The proof is nearly identical to that of Theorem 
5, except that we now set A^/ = UnU. The fact that (rs)w = 1 = jfi has no 
estimation error eliminates an order (p„ logp^/n)^/^ that contributes from 
estimating tr((rs — Tq)^qAi/^q) for (3.2). This is why we can estimate a 
sparse correlation matrix more accurately. 

For the operator norm result, we refer readers to the proof of Theorem 2 
of Rothman et al. (2008). □ 



Proof of Theorem 10. For (T, D) a minimizer of (4.2), the derivative 
for g3(T,D) w.r.t. for £ is 

^^^^iHl = 2((ST^D-i),, + p',„3(|t,,|)sgn(t,,)). 

Now ST'^D^i = Ii + l2+l3 + h, where 

/i = (S - So)T^D-\ /2 = So(T-To)^D-\ 

/3 = SoT^(D"i-Doi), /4 = SoT;^Do-\ 

By the MCD (4.1), h = Tg ^ Since i > j for £ 5|, we must have 

(To^)ji = 0. Hence, we can ignore I4. 

Since ||T - Top = 0(r/„) and ||D - DqP = 0(Cn) with ?/„,Cn = o(l), and 
by condition (A) we can easily show ||D~^ — Dq ^|| =0(||D — Do||) = 0{Cn'^)- 
Then we can apply Lemma 3 to show that maxjj|(/i)jj| = (logp„/n)^/^. 

For I2, we have maxj^l (/2)jj| < \\^o\\ ■ ||T - To|| • ||D"i|| = ©(r/^^). And 

finally, max^.-j (/3)iil < ||5^o|| • ||To|| • - Dq i|| = ©(C^ )• 

With all these, we have max^j ^jg^^l (ST^D~^)jjp = logp„/n + r/„, + Cn- 
The rest of the proof goes like that of Theorems 2 or 6. □ 
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